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Abstract 

Machine Learning methods have of late made sig- 
nificant efforts to solving multidisciplinary prob- 
lems in the field of cancer classification using mi- 
croarray gene expression data. Feature subset se- 
lection methods can play an important role in the 
modeling process, since these tasks are character- 
ized by a large number of features and a few obser- 
vations, making the modeling a non-trivial under- 
taking. 

In this particular scenario, it is extremely im- 
portant to select genes by taking into account the 
possible interactions with other gene subsets. This 
paper shows that, by accumulating the evidence in 
favour (or against) each gene along the search pro- 
cess, the obtained gene subsets may constitute bet- 
ter solutions, either in terms of predictive accuracy 
or gene size, or in both. The proposed technique 
is extremely simple and applicable at a negligible 
overhead in cost. 



1 Introduction 

In the last years research in feature subset selec- 
tion (FSS) has become a hot topic, boosted by 
the introduction of new application domains and 
the growth of the number of features involved 
|Liu and Motoda, 1998] . An example of these new 
domains is web page categorization, a domain cur- 
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rently of much interest for internet search en- 
gines where thousands of terms can be found in 
a document. Another example is found in can- 
cer classification by gene expression using DNA 
microarrays, a domain where Machine Learning 
methods are now extensively used for this task 
[Duan et al., 20 05 . Problems with many features 
and a limited number of observations are also very 
common in molecule classification or medical diag- 
nosis, among others. 

The selection of a new feature (either to be re- 
moved or added to the current set) involves the 
evaluation of many models. These models typ- 
ically consist of the addition (deletion) of one 
feature to (from) the current set. In wrapper 
methods, an inducer is called to build tempo- 
rary solutions and return their evaluation using 
some resampling method (e.g. cross-validation) 
|Kohavi and John7 l997 . 

In the standard procedure, only the best such 
model evaluation is considered for selecting which 
feature should removed or added, and the remain- 
ing evaluations are readily discarded. Yet there is 
valuable information in the discarded evaluations: 
the very many evaluated subsets contain informa- 
tion on the relevance of the features that belong to 
the subset; this relevance does not depend on the 
subset being selected or not. When an inducer is 
requested to estimate the predictive accuracy of a 
model using a given feature subset within a wrap- 
per strategy, no indication is given on which fea- 
ture is the most recent addition (or deletion): the 
inducer just sees a feature subset which has to be 
evaluated as a whole. 

Since the most difficult part of a FSS process is 
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to evaluate the interactions between features, the 
accumulated evaluation of a feature in diverse con- 
texts should account for many of these interactions, 
and ultimately provide with a more informed esti- 
mation of usefulness for the chosen inducer. The 
different contexts of a particular feature x are given 
by all those subsets which are being evaluated along 
the search process (not necessarily to assess the in- 
fluence of noted above), either containing or 
not containing x. 

Our idea is to accumulate the inducer evalua- 
tions as a rich source of information. This informa- 
tion can then be used in conventional existing algo- 
rithms, such as the well-known forward or backward 
selection. This idea can be applied to any sequen- 
tial search algorithm and any inducer and, as shown 
below, at a negligible extra cost. 

In this paper we present experimental results 
showing good performance in a suite of bench- 
mark microarray problems. The proposed modifi- 
cation always achieves improvements when applied 
to standard backward selection, either in the esti- 
mated predictive accuracy, in the size of the deliv- 
ered gene subsets, or in both. 



imizing J. The evaluation measure maybe inducer- 
independent (as in filter methods) or may be the 
same inducer being used to solve the task (as in 
wrapper methods). In either case, we will refer to 
Jc(X) as the usefulness oflCF estimated using 
the inducer L. Since the inducer evaluation in a 
sample varies depending on the resampling method 
used, we prefer to use the notation Jc(X) instead 
of simply C(X) to express such evaluation. 

In the literature, several suboptimal algorithms 
have been proposed for doing this. Among them, 
a wide family is formed by those algorithms which, 
departing from an initial solution, iteratively add or 
delete features by locally optimizing the objective 
function. The search starts with an arbitrary set 
of features (e.g. the full set or the empty set) and 
moves iteratively to neighbor solutions by adding 
or removing features. Among the most used algo- 
rithms for this problem are the sequential forward 
generation (SFG) and sequential backward genera- 
tion (SBG), their generalization plus I - take away 
r or PTA(l, r) Stearns, 1976 Q| or the floating 

1994] . These latter 



Pudil et al., 



search methods 
algorithms work by combining SFG and SBG steps. 



2 Accumulated Evidence in 
Feature Subset Selection 

2.1 Preliminaries 

It is common to see feature subset selection (FSS) 
in a set Y of size n as an search problem where 
the search space is the power set of Y, P(Y) 
|Langley, 1994| . Each state in the search space cor- 
responds to a subset of features. Exhaustive search 
is usually intractable, and methods to explore the 
search space efficiently must be employed. These 
methods are often divided into two main categories: 
filter methods and subset selection methods. A ma- 
jor disadvantage of filter methods is that they are 
performed independently of the classifier, and the 
same set of features need not be optimal for dif- 
ferent classifiers. Most filter methods disregard the 
dependencies between features, as each feature is 
considered in isolation. 

Without loss of generality, it can be assumed that 
the evaluation measure J : V(Y) — > R + U {0} is to 
be maximized. In this setting, the problem is to 
find the optimal subset X G P(Y) as the one max- 



2.2 Accumulated evidence and fea- 
ture relevance 

The idea consists on accumulating the evidence in 
favor or against a feature, taking into account its 
history of evaluations alongside different feature 
subsets. A further explanation can be to extract 
the most of every subset evaluation, normally the 
most costly part of a FSS process. 

Let Y X = {X e V(Y)\x G X} be the set of all fea- 
ture subsets of the initial set that contain a certain 
feature x (note that \Y X \ — 2 n ~ x for all x G Y). 

Let £+ and C~ be the average evaluation of all 
subsets containing and not containing x: 

MX) 

C = ^lE MX) 

Given an inducer C (either filter or wrapper) de- 
fine, for a given feature x G Y, the relevance of x 
as: 

1 SFG is PTA(1,0) and SBG is PTA(0,1). 
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R c (x)=Ct-C- (1) 

The above definition can be more compactly ex- 
pressed as: 

^W = ^lE [j c (X U {x}) - J C (X)) 

(2) 

Remark 1. Defining feature relevance with ex- 
pression © is very attractive, since it captures fea- 
ture interactions in all possible ways. We take the 
freedom of presenting an informal but hopefully il- 
lustrative analogy of what this measure captures. 
Imagine we are willing to evaluate the average in- 
fluence of a basketball player on a team scoring: we 
can compute the difference in points that the team 
scores with and without this player, no matter what 
other players are playing in the players team. If this 
difference is positive, then we can conclude that this 
players accomplishments are positive for the team; 
otherwise we conclude that we should better sell the 
player at the best possible price! Note that in this 
example, only subsets X of size 4 are considered 
and Y \ X is the bencrJE 

Remark 2. Full evaluation of expression @ 
has an exponential cost in n, making it unfeasi- 
ble for most practical applications; an estimation 
is therefore mandatory via Monte Carlo techniques, 
generating feature subsets randomly from a precise 
probability distribution determined by the FSS al- 
gorithm being used. Oddly, although Rc(x) takes 
into account all possible feature interactions, by 
its very nature it does not capture redundancy: 
two identical features will have the same relevance. 
This is true even by making Jc cope with redun- 
dancy. However, since a search algorithm will im- 
pose an order on the evaluated feature subsets, the 
current state can be used to ascertain redundancy, 
as will be shown below. 

The above expressions can be conveniently gen- 
eralized by considering a weighing function w: 

E { MX U {x}) - J c (x)) w x (X) 



(3) 

2 Incidentally, this way of ranking players (together with 
rebounds, assists, etc) is used in the NBA. 



For example, the choice w x (X) = \X\/\Y\ = 
\X\/n gives more importance to improvements in 
Jc achieved in a scenario with already many fea- 
tures (improving performance in such a case has 
a certain merit); alternatively, one could choose 
w x (X) — Jc{X); this choice expresses the belief 
that an improved performance when Jc(X) is al- 
ready high should be rewarded, and less so when 
it is low (it has a much lower merit). Many al- 
ternatives are possible and the best one (if such 
choice exists at all) is at the moment an open ques- 
tion. Note that eq. © reduces to eq. ([T]) when 
w x (X) = 1 for all x. 

In the following, we present a practical method 
to approximate this measure of relevance and inte- 
grate it in a SBG search algorithm at no additional 
cost. The idea consists on accumulating the evi- 
dence in favor or against a feature by taking into 
account the history of evaluations throughout the 
search process. 

2.3 Practical computation of the ac- 
cumulated evidence 

Let Xk denote the current set, where \Xk\ = k, 
for notational simplicity (thus Xq = and X n — 
Y); let X n -k be the set of features not in Xk, i.e. 
X n ~k = Y \ Xk- Assume first we are in front of 
performing & forward step. Given Xk, in a classical 
SFG, the set 

^Jc{Xk U {x}) | x G A"„_fc j is computed (4) 

and the feature x' — argmax Jc{Xk U {x}) is se- 
lected. However, all the remaining information: 

| Jc(Xk U {x}) | x € X n _k, x ^ x'\ is discarded, 

(5) 

yet sometime in the future these individual fea- 
tures x (and eventually x' itself) will be considered 
again for inclusion or exclusion from the current set 
in forward or backward steps, respectively. 

Conversely, in a backward step the search algo- 
rithm is going to evaluate a feature x for possible 
exclusion from X n -k in such a way that the set 

{jc{X n -k \ {%}) I x S A^-fc j is computed (6) 
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and the feature x' = argmax Jc(X n _k \ {%}) is 

xex n _ k 

selected for removal. Again, the information: 



{j c (x n - k \{x})\xex n ^ x' I is discarded. 

(7) 

Yet, sometime in the future these individual fea- 
tures x (and eventually x' itself) will be considered 
again for inclusion or exclusion from the current set 
in forward or backward steps, respectively. Rea- 
soning in more general terms, the search algorithm 
always evaluates a feature x for possible inclusion 
in (or exclusion from) the current subset using in- 
formation about x. 

Now let Pc denote the set of feature subsets that 
the search algorithm has evaluated so far (imply- 
ing a call to C). Let P c \ x = {X e P c \x e X}. 
For every x G Y, define the accumulated evalua- 
tions (or simply accumulators) as the Monte Carlo 
estimations: 



E Jc(X)w x (X) 
E w x (X) 

E Jc{X)w x (X) 

E w x(x) 

X<tPc\* 



(8) 



(9) 



which are approximations to the weighted ver- 
sions of £+ and C~ , respectively. These two 
approximated values depend on the search algo- 
rithm, which determines the strategy to traverse 
the search space. Different FSS algorithms (such 
as SFG or SBG) provide different traces of eval- 
uated subsets at any given number of algorithmic 
steps. In these conditions, the impact of the con- 
sidered feature in the current subset X can be used 
to ascertain redundancy and make it influence the 
search, by modullating the effect of the accumu- 
lated evaluations. Consider now, for A 6 [0, 1], 



C- + l) + (l-X)J c (x), (10) 



where Jc{x) = Jc(X \ {x}) in a backward step 
(the effect of removing x from X) and Jc( x ) — 
Jc(X U {x}) in a forward step (the effect of adding 



x to X) and A is a free parameter. This scheme gen- 
eralizes conventional forward and backward steps 
(as used by SFG, SBG or any other sequential al- 
gorithm) in two ways: 

1. By setting A = 0, the conventional forward 
and backward steps are recovered and both 
relevance and redundancy are evaluated using 
Jc{x). By setting A = 1, a pure arithmetic 
average between £+ and 1 — £~ is computed. 

For other values of A, the search history makes 
an influence on the search itself, conditioning 
the selection of features. In this case, only a 
1 — A fraction of the importance is assigned to 
the current subset evaluation. 

2. The search history itself is formed by all known 
contexts in which the considered feature could 
appear or not (and not only by previous eval- 
uations of the feature), thus conforming a 
broader picture of its true relevance. 

Example. Consider the following feature subset 
mask (n = 20) for a current feature subset X% C Y 
where the z-th index is 1 when feature Xi € X$ and 
otherwise: 

10010010001010100101 

signaling the presence of features number 1,4, 7, 
etc. An evaluation Jc{X) of this subset is indeed 
expressing how good is to have the first feature but 
not the second or the third, also how good is to 
have the seventh feature but not the one before the 
last, and so forth. For this reason, all the features 
in Y (and not only those in X) should have their 
accumulators updated every time. 

3 A practical algorithm 

We illustrate the approach on the popular SBG 
search algorithm (Algorithm 1) and give a prac- 
tical implementation of the previous ideas for it 
(SBG + , Algorithm 2). In addition, for simplic- 
ity of presentation, we fix w x (X) = 1. In this case, 
normalization simply amounts to a division by the 
number of performed accumulations. The initial- 
ization of the accumulated relevances is for all 
x € Y. The results are first accumulated and then 
used; for this reason, even in the first algorithmic 
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step (the first discarded feature) the behavior of 
both algorithms may start to diverge. At the end of 
the FSS process, n+ (resp. n~) will be the number 
of times that a feature subset (resp. not) containing 
x has been evaluated. Note that the computation is 
done at a negligible overhead in cost; this is due to 
the fact that the inducer is called exactly the same 
number of times for SBG than for the accumulated 
counterpart SBG + . 

Algorithm 1 SBG (inducer C, feature set Y) 
1 

2 
3 
1 
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X n <- Y 
k^O 
repeat 

for all x G A„_fc do 

compute the set j Jc{X n -k \ {a:}) j 
end for 

x' <- argmax J £ (A n _ fe \ {x}) 

X n _ k <- A„_ fe \ {x'} 
k <- k + 1 
until k = n 

return argmax Jc(Xk) 

k—l-^-n 



4 Experimental work 

Experimental work is now presented in order to 
assess the described modifications using two se- 
quential algorithms: SBG and its accumulated 
counterpart SBG + . The algorithms were imple- 
mented using the R language for statistical com- 
puting |R Development Core Team, 2008 . 



5 Experimental settings 

Each full experiment consists of an outer loop of 
5x2-cross-validation (5x2cv) for model selection, 



as proposed by several authors Dietterich, 1998 



|Alpaydin, 1999| . This procedure performs 5 repeti- 
tions of a 2-fold cross-validation. R keeps half of the 
examples out of the feature selection process and 
uses them as a test set to evaluate the final quality 
of the selected features. For every fold and repeti- 
tion of the outer cross-validation loop, two feature 
selection processes are conducted with the same ex- 
amples, one with the original algorithm (SBG) and 



X n ^Y 

k^O 

{Initialize accumulators and counters} 

Vx € Y, n+ <- n~ <- 
repeat 

for all x G X n ~k do 

compute the set j Jc(X n -k \ {#})} 

end for 

{Update accumulators and counters} 
for all x G Y do 
if x G X n ^k then 

£ Jc(X n - k \{y}) 
yex n - k \{x} 

else 

C <" 4 + MXn-k \ {X}) 

n x <— n x + 1 
end if 
end for 

x' <- argmax{|(£+/n+ - £-/n" + 1) 

xex n _ k 

+(1- \)J C (Xn-k\{x})} 

X n - k <- X n - k \ {x 1 } 

k <r- k + 1 

until k = n 

return arg max Jc (Xk ) 

k—l-^-n 
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one with the accumulated version (SBG + ). 

Each feature selection iteration uses the 
1- nearest-neighbor learner implementation in 
|Venables and Ripley, 2002 (which uses Euclidean 
distance), linear discriminant analysis (LDA) 
and the Support Vector Machine with radial 
kernel (SVM r ). The parameters of the SVM (the 
regularization constant or cost and the kernel 
width) are kept fixed to their default values in all 
the experiments, since we are only interested in 
the influence that different feature subsets have on 
the modelling^. 

The evaluation of these inducers is resampled in 
a second (inner) 5x2cv loop for a more informed 
estimation of usefulness. In all cases, stratification 
is used to keep the same proportion of class labels 
across the partitioned sets. After some preliminary 
experiments, we set A = | in expression (|10[) . It 
is very important to mention that there is no stop- 
ping criterion in the algorithms: the two backward 
methods run until all the features have been re- 
moved. Then the best subset in the obtained se- 
quence of subsets is returned. This setting avoids 
the specification of an a priori size for the solution. 
It also eliminates the possibility that the accumu- 
lated algorithm performs differently simply because 
it merely influences the stopping point. 

Once the best feature subset is found (a different 
one in every outer loop) , this subset is evaluated in 
the corresponding test set. The final test error (the 
one reported) is the mean of these 10 values. 

5.1 Benchmarking microarray data 
sets 

In a microarray gene expression context, there is 
a wide spectrum of FSS algorithms. Commonly 
found methods fall into the filter category: a list 
of the top-ranked genes based on some inducer-free 
figure of merit is generated, followed by and induc- 
tive process where a classifier is incrementally eval- 
uated |Ruiz et al., 2006J . This constitutes a fast 
and low complexity approach. However, consid- 
ering individual contributions only can hinder the 
discovery of possible interactions between genes. 

Many authors have claimed that the wrap- 
per approach, if affordable, is preferable to 



the filter approach (e.g. |Liu and Motoda, 1998[ 
|Kohavi and John, 1997] ) . It is therefore of the 
greatest importance to take the most of every eval- 
uation of the inducer, which is normally the more 
costly part. 

Validation of the described approach uses five 
public-domain microarray gene expression data 
sets, shortly described as follows: 

1. Colon Tumor: Used originally by 
|Alon et al., 1999] , it consists of 62 sam- 
ples of colon tissue, of which 40 are tumorous 
and 22 normal, and contains 2,000 genes. 



2. Leukemia: Used first by |Golub et al., 1 999 , 
the training set consisted originally of 38 bone 
marrow examples (plus a further test set with 
34 examples). This set of examples has been 
merged to form a data sample of 72 examples, 
which are described by 7,129 probes: 6,817 hu- 
man genes and 312 control genes. The goal is 
to tell acute myeloid leukemia from acute lym- 
phoblastic leukemia. 



Lung Cancer 
IGordon et al, 2002 



Studied by 
the problem con- 
sists in distinguishing between malignant 
pleural mesothelioma and adenocarcinoma of 
the lung. There are 181 examples available, 
described by 12,533 genes. 

4. Prostate Cancer: This data set was used by 
Sing h et al., 2 002 to analyze differences in 
pathological features of prostate cancer and to 
identify genes that might anticipate its clinical 
behavior. There are 181 examples and 12,600 
genes. 



3 These values are 1 for the cost parameter and the inverse 
of the number of features for the smoothing parameter in the 
kernel. 



5. Breast Cancer: [Veer et al., 2002] studied 97 
patients with primary invasive breast carci- 
noma; 24,481 genes were analyzed. 

These problems are hard for several reasons, in 
particular the sparsity of the data, the high dimen- 
sionality of the feature (gene) space, and the fact 
that very many features (the genes) are irrelevant 
or redundant. In these situations, performing fea- 
ture selection is at best a delicate task that entails 
a very high risk of overfitting, even when the full 
set features has been preprocessed to lower the di- 
mensionality of the problem. 
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We made a preliminary selection of genes on the 
basis of the ratio of their between-groups to within- 
groups sum of squares, as in other approaches, to 
make a wrapper approach computationally feasi- 



ble |Dudoit et al., 2002| . In this work, the top 200 
genes for each dataset were selected as the source 
of study. It is important to stress that there has 
been little effort to find the best models among 
those represented by the considered inducers: in 
other words, nearest-neighbors is limited to just one 
neighbour and the SVM parameters have been set 
to their default values. All the effort is devoted to 
find good feature subsets and to compare the two 
search algorithms in similar experimental circum- 
stances. 

For comparative purposes, performance results 
using the whole set of features and the reduced 
subset of 200 features are displayed in Table [TJ In 
view of these results, it is clear that these subsets 
constitute a very good departing point for further 
analysis with wrapper methods. 



Problem 


INN 


LDA 


SVM r 


Y X200 


Y X200 


Y X200 


Colon Tumor 


23.9 23.2 


24.8 20.0 


31.0 14.8 


Leukemia 


9.7 8.3 


14.1 3.1 


26.7 2.8 


Lung Cancer 


1.8 2.0 


N/A 1.8 


4.4 1.0 


Prostate Cancer 


23.4 19.1 


N/A 25.5 


38.2 26.9 


Breast Cancer 


45.1 27.7 


N/A 24.5 


48.3 24.1 



Table 1: Average test error (in %) for the different 
inducers in the preprocessing phase. Y: using the 
full set of genes; A200: using the top pre-selected 
200 genes; N/A: computation unaffordable due to 
numerical inaccuracies in LDA. 



Problem 


INN 


LDA 


SVM r 


SBG+SBG 


SBG+SBG 


SBG+SBG 


Colon Tumor 


18.1 20.0 


19.0 22.2 


18.1 18.7 


Leukemia 


8.1 10.9 


16.7 17.7 


7.8 9.2 


Lung Cancer 


3.3 3.4 


2.7 3.4 


3.4 3.5 


Prostate Cancer 


14.0 15.5 


24.8 26.4 


21.9 22.0 


Breast Cancer 


26.2 29.3 


27.4 36.7 


23.7 25.6 


Average 


13.9 15.8 


18.1 21.3 


15.0 15.8 



Table 2: Average test error (in %) for the different 
inducers when comparing SBG + to SBG. 



Problem 


INN 


LDA 


SVM r 


SBG+SBG 


SBG+SBG 


SBG+SBG 


Colon Tumoi 


37.4 73.8 


70.5 79.2 


15.5 14.2 


Leukemia 


7.2 28.3 


30.0 32.5 


6.1 37.2 


Lung Cancer 


17.4 20.0 


4.1 13.4 


4.5 8.8 


Prostate Cancer 


18.3 19.3 


23.5 44.3 


12.9 8.1 


Breast Cancer 


60.2 34.2 


22.4 52.6 


13.0 17.5 


Average 


28.1 35.1 


30.1 44.4 


10.4 17.2 



Table 3: Average gene subset sizes for the different 
inducers when comparing SBG + to SBG. 

6 Discussion 

The results of the FSS process are displayed in 
Tables [5] and [3] The first table shows the (cross- 
validated) average test error for the two algorithms 
and the different inducers. The second table shows 
the (cross-validated) average size of the final se- 
lected subsets. 

The first fact to note is that the accumulated 
version outperforms the standard version (though 
in general by a modest margin) in all cases. This is 
a very remarkable result, given the big differences 
among the problems and among the inducers. Sec- 
ond, SBG + finds in general solutions of lower size 
than SBG does, sometimes by a substantial amount 
(e.g., INN in Colon Tumor and Leukemia, most 
of LDA, or Leukemia and Lung Cancer with the 
SVM). Given that there is no stopping condition, 
our explanation is that the standard backward ver- 
sion is greedier than the accumulated one. By the 
(early) inclusion of some (or many) features that 
are not as good as they look in that moment, and 
cannot be removed, SBG is driven toward worse 
local minima of the error function as compared to 
SBG + . The greediness itself is explained by the 
purely local (in the temporal sense) character of 
SBG and it also explains the worse prediction re- 
sults of this algorithm. 

Feature selection appears to be a viable avenue 
for dimensionality reduction in this field: a reduc- 
tion of two orders of magnitude in the number of 
features by univariate methods shows substantial 
improvements (Table [TJ. With a further reduc- 
tion of another order of magnitude, mean perfor- 
mance of the finally selected classifiers is similar 
to that achieved using the previously reduced sub- 
set. This behavior is important, both for compu- 
tational and scientific reasons. Even without op- 
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timization of free parameters (a necessary step in 
normal conditions), cross- validated wrapper com- 
putations with 200 features may take several days 
of computing time on a modest machine. Scien- 
tifically, coping with hundreds of features and pre- 
tending interpretability of the role of every feature 
in the model is out of the question in many cases. 
This is aggravated in the present situation of data 
scarcity. 

The results diverge for different classifiers, as it 
may be reasonably expected. This is of the great- 
est importance when assessing whether an improve- 
ment is consistent, or is limited to a certain type of 
method. In this sense, INN seems to be the best 
method for Prostate Cancer, LDA for Lung Can- 
cer and the SVM for the other three (in all cases 
using SBG+). The SVM tends to deliver smaller 
gene subsets, both for SBG and SBG + . Given that 
the SVM parameters were not optimized beyond 
educated guesses, we think there is room for fur- 
ther improvement in the modeling, specially on the 
accuracy side. 

Comparison to other results in the literature us- 
ing the same data sets is a delicate undertaking 
in general. The methodological steps can be very 
different, especially concerning resampling tech- 
niques. We have found that many times there are 
no true test sets: feature subsets or model parame- 
ters (or both) are optimized by means of one or 
several resampled runs of cross-validation. This 
procedure is dangerous in that it cannot deliver 
an unbiased estimation of true error, given that, 
although test observations have not been used to 
create the model, they have been used to decide 
upon competing ones (namely, in the feature selec- 
tion process itself). The stability of these results 
is also compromised if only one resample is carried 
out. On the other hand, the delivered gene subset 
size is a very important issue to bear in mind, if 
the solutions are to become interpretable and use- 
ful from the clinical point of view. That said, we 
compare with several references illustrative of re- 
cent work: 

1. For the Colon Tumor data set, 
[Wang et al., 2008] report an error of 12.7% 
with 94 genes, while |Bu et al., 2007| report 
an error of 23.0% with 33 genes, both using 
radial SVMs. For this dataset, we report a 
test error of 18.1% using an average of 15 



genes. 



2. For the Leukemia problem, Bu et al., 2 007 
report an error of 4.0% with 30 genes using a 
radial kernel, and an extraordinary 1.4% using 
only two genes and filter methods for ranking 
|Hewett a nd Kijsa nayothin, 2008| . For this 
dataset, we report an average test error of 6.1% 
using an average of 6 genes. 

3. The Lung Cancer data set is apparently 
the easiest to separate. Accuracy values as 
high as 99% are achieved by |Bu et al., 2007| 
(using a SVM and 38 genes) and by 
|Hong and Cho, 2008] , this time using 5NN 
and as much as 135 genes. For this dataset, 
we report an average test error of 2.7% using 
an average of 4 genes. 

4. In the Prostate Cancer problem, as low as 7% 
error as been reported (half our best result) 
using a radial SVM and 47 genes (nearly three 
times our result) |Bu et al., 2007] . 

5. Finally, for the Breast Cancer problem, an er- 
ror of 21% is reported using a radial SVM 
and 46 genes |Bu et al., 20 07 , and an er- 
ror of 32% using again a SVM and 8 genes 
Hewett and Kijsanayothin, 2008 . For this 



dataset, we report an average test error of 
23.7% using an average of 13 genes. 

7 Conclusions 

This paper has presented a modification suitable 
for feature subset selection algorithms that itera- 
tively evaluate subsets of features, by making them 
accumulate all the "log of merit" of the features in 
quite different contexts. The idea consists in that 
the current subset evaluation is not used directly to 
select the feature to add (or remove), but to accu- 
mulate information on the usefulness of the feature 
in many contexts. The different contexts of a par- 
ticular feature x are given by all those subsets that 
contain x (they express how good is to have x) and 
do not contain x (they express how good is not to 
have x). The accumulated information is then used 
to decide which feature should be added or removed 
(namely, that feature with the highest (lowest) ac- 
cumulated usefulness which has not yet been added 
(removed)). Therefore, the search history makes an 
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influence on the search itself, conditioning the se- 
lection of features. This view is consistent with 
the definition of a search algorithm as a mapping 
from its history (including its present state) to the 
set of possible moves. In these conditions, less im- 
portance is assigned to the current subset evalu- 
ation than in a classical FSS setting (where it is 
the only source of information). Our experimental 
results indicate a general improvement in perfor- 
mance, without any additional modelling effort. 

Future work includes exploring SFG. The deci- 
sion to study SBG in the first place is consistent 
with the goal of discovering feature interactions. 
Having all the features from the beginning greatly 
facilitates this task. Nonetheless, the more modest 
computational demands that SFG entails in prac- 
tice (if cut before exhaustion of features) may be an 
appealing characteristic. It is relevant to point out 
that the presented algorithmic modification may be 
of little help if an algorithm has many opportuni- 
ties to rectify its decisions (e.g., the PTA(Z, r) fam- 
ily of algorithms). However, even in this case, the 
forward or backward steps will be more informed, 
possibly making the search algorithm deliver bet- 
ter solutions at earlier stages. Unfortunately, the 
0(n t+r+1 ) cost of PTA(Z,r) can well make it pro- 
hibitively high for microarray data problems in 
wrapper mode. 

A clear avenue for further research is the setting 
of the free parameter, A. It is our conjecture that 
an adaptive value may deliver better results. In 
this sense, the influence of past evaluations may be 
different at early or last stages of a search process. 
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